Capstone project

Datascience Lifecycle mainly involves:

  1. <a href=#data_collection>Data collection</a>
  2. <a href=#data_processing>Data Processing</a>
  3. <a href=#data_viz>Exploratory Data Analysis</a>
  4. <a href=#hypothesis_ML>Hypothesis testing and Machine Learning</a>
  5. <a href=#data_insights>Insight and Policy decision</a>

I will use data from 45 a kaggle walmart store sales forecasting competition to show the lifecycle of a data science project.

The ability to accurately predict weekly sales for any retail store is crucial in planning for inventory and maximizing efficiency to improve customer experience and profits.

Below are the details instructions for the competition: You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data

In [1]:
#import some of the libraries needed for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.charts import Line, Scatter, TimeSeries, show, output_file, vplot
from bokeh.io import output_notebook
from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()
Loading BokehJS ...

A. Data Collection

Features file description This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  1. Store - the store number
  2. Date - the week
  3. Temperature - average temperature in the region
  4. Fuel_Price - cost of fuel in the region
  5. MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  6. CPI - the consumer price index
  7. Unemployment - the unemployment rate
  8. IsHoliday - whether the week is a special holiday week
In [2]:
features = pd.read_csv("features.csv")
features.head()
Out[2]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False
1 1 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True
2 1 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False
3 1 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False
4 1 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False

Stores.csv file: description This file contains anonymized information about the 45 stores, indicating the type and size of store.

In [3]:
stores = pd.read_csv("stores.csv")
stores.head()
Out[3]:
Store Type Size
0 1 A 151315
1 2 A 202307
2 3 B 37392
3 4 A 205863
4 5 B 34875

Train.csv file description: This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

  1. Store - the store number
  2. Dept - the department number
  3. Date - the week
  4. Weekly_Sales - sales for the given department in the given store
  5. IsHoliday - whether the week is a special holiday week
In [4]:
train = pd.read_csv("train.csv")
train.head()
Out[4]:
Store Dept Date Weekly_Sales IsHoliday
0 1 1 2010-02-05 24924.50 False
1 1 1 2010-02-12 46039.49 True
2 1 1 2010-02-19 41595.55 False
3 1 1 2010-02-26 19403.54 False
4 1 1 2010-03-05 21827.90 False

Test.csv file description This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file

In [5]:
test = pd.read_csv("test.csv")
test.head()
Out[5]:
Store Dept Date IsHoliday
0 1 1 2012-11-02 False
1 1 1 2012-11-09 False
2 1 1 2012-11-16 False
3 1 1 2012-11-23 True
4 1 1 2012-11-30 False

2. Data Processing

Split the dates of all tables into year, month and day.

In [6]:
# Functions splits a 
#parameters: dates -  pandas column dataframe containing dates in the format year-month-date
# returns: a tuple of arrays containing year, month and day in the format (year, month, date)
def split_date(dates):
    year = []
    month = []
    day = []
    new_date = []
    for row in dates:
        new_date = row.split("-")
        year.append(int(new_date[0]))
        month.append(int(new_date[1]))
        day.append(int(new_date[2]))
        
    return (year, month, day)
        
        
    
In [7]:
#Split the dates for the test table into year, month and Day
year, month, day = split_date(test["Date"])
test["Year"] = year
test["Month"] = month
test["Day"] = day
test.head()
Out[7]:
Store Dept Date IsHoliday Year Month Day
0 1 1 2012-11-02 False 2012 11 2
1 1 1 2012-11-09 False 2012 11 9
2 1 1 2012-11-16 False 2012 11 16
3 1 1 2012-11-23 True 2012 11 23
4 1 1 2012-11-30 False 2012 11 30
In [8]:
#Split the dates for the train table into year, month and Day
year, month, day = split_date(train["Date"])
train["Year"] = year
train["Month"] = month
train["Day"] = day
train.head()
Out[8]:
Store Dept Date Weekly_Sales IsHoliday Year Month Day
0 1 1 2010-02-05 24924.50 False 2010 2 5
1 1 1 2010-02-12 46039.49 True 2010 2 12
2 1 1 2010-02-19 41595.55 False 2010 2 19
3 1 1 2010-02-26 19403.54 False 2010 2 26
4 1 1 2010-03-05 21827.90 False 2010 3 5
In [9]:
year, month, day = split_date(features["Date"])
features["Year"] = year
features["Month"] = month
features["Day"] = day
features.head()
Out[9]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday Year Month Day
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False 2010 2 5
1 1 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True 2010 2 12
2 1 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False 2010 2 19
3 1 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False 2010 2 26
4 1 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False 2010 3 5

We would also need to determine if any relationship exists between our sales data and other features which are in other tables. For easier analysis between sales in training data and other features I will left join train table with stores table and features table to add more columns to the training data.

left join train(left) table with features(right) table

In [10]:
train_data = train.merge(features, on =['Store', 'IsHoliday', 'Year', 'Month', 'Day'], how = 'left')
train_data.head()    
Out[10]:
Store Dept Date_x Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment
0 1 1 2010-02-05 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106
1 1 1 2010-02-12 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106
2 1 1 2010-02-19 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106
3 1 1 2010-02-26 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106
4 1 1 2010-03-05 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106

Delete unnnecessary columns such as Date_x and Date_y

In [11]:
del train_data['Date_x']
train_data.columns
Out[11]:
Index(['Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Year', 'Month', 'Day',
       'Date_y', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment'],
      dtype='object')

Left join train_data(left) table with stores(right) table This will help us get additional columns: store type and size which help us have more features to analyze our sales

In [12]:
train_data = train_data.merge(stores, on = 'Store', how = 'left')
train_data.head()
Out[12]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size
0 1 1 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 1 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 A 151315
2 1 1 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 A 151315
3 1 1 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 A 151315
4 1 1 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 A 151315

Change the column data types to types that allow for easier analysis and manipulation of data

  1. train_data['Date_y'] = pd.to_datetime(train_data['Date_y'], yearfirst = True)
  2. Change the following columns from string to numeric data type: Store, Dept, Fuel_Price, Weekly sales, Markdown1 to Markdown5, CPI, Unemployment, Size
  3. Change the IsHoliday column from boolean to binary values 1 representing True and 0 represents false
  4. Change the Store type column from letters to numbers that correspond to specific letters
In [13]:
#1. type changed from string to date_time type
train_data['Date_y'] = pd.to_datetime(train_data['Date_y'], yearfirst = True)

#2. type changed from string to numeric
train_data[['Store','Dept','Weekly_Sales', 'Temperature']] = train_data[['Store','Dept','Weekly_Sales', 'Temperature']].apply(pd.to_numeric)
train_data[['Fuel_Price','CPI','Unemployment', 'Size']] = train_data[['Fuel_Price','CPI','Unemployment', 'Size']].apply(pd.to_numeric)
train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']] = train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].apply(pd.to_numeric)

#3. Change IsHoliday from boolean to binary
holiday = []
types = []
for index, row in train_data.iterrows():
    if (row['IsHoliday'] == True):
        holiday.append(1)
    else:
        holiday.append(0)

        
#4. Change Type column to numeric values            
     
    if (row['Type'] == 'A'):
        types.append(1)
    elif (row['Type'] == 'B'):
        types.append(2)
    elif (row['Type'] == 'C'):
        types.append(3)
    else:
        types.append(4)

        
train_data['Holiday'] = holiday  
train_data['Type_n'] = types
    
train_data.head()
Out[13]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price ... MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size Holiday Type_n
0 1 1 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 ... NaN NaN NaN NaN 211.096358 8.106 A 151315 0 1
1 1 1 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 ... NaN NaN NaN NaN 211.242170 8.106 A 151315 1 1
2 1 1 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 ... NaN NaN NaN NaN 211.289143 8.106 A 151315 0 1
3 1 1 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 ... NaN NaN NaN NaN 211.319643 8.106 A 151315 0 1
4 1 1 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 ... NaN NaN NaN NaN 211.350143 8.106 A 151315 0 1

5 rows × 21 columns

3. Exploratory Data Analysis and Visualization

We are going to reduce the size of the dataframe to a sample of 10,000 rows for easier visualization and manipulation because the current size of the dataframe requires huge computing power

In [14]:
#save the current dataframe to another variable
train_df = train_data.copy(deep = True)
train_df = train_df.sample(n = 30000)
train_df.head()
Out[14]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price ... MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size Holiday Type_n
336098 35 42 8765.73 True 2011 2 11 2011-02-11 30.45 3.239 ... NaN NaN NaN NaN 137.137832 8.549 B 103681 1 2
372933 40 14 9785.84 False 2012 5 18 2012-05-18 57.59 3.899 ... NaN 103.83 735.81 2388.29 138.106581 4.125 A 155083 0 1
267571 28 5 25573.29 False 2010 6 18 2010-06-18 81.06 3.043 ... NaN NaN NaN NaN 126.114000 14.099 A 206302 0 1
249969 26 23 16072.94 False 2011 10 28 2011-10-28 35.06 3.604 ... NaN NaN NaN NaN 136.488452 7.598 A 152513 0 1
21255 3 6 2259.99 False 2011 3 18 2011-03-18 69.47 3.488 ... NaN NaN NaN NaN 217.723523 7.551 B 37392 0 2

5 rows × 21 columns

In [41]:
t = TimeSeries(train_df,
    x='Date_y', y=['Weekly_Sales'],
    title="Timeseries", ylabel='Weekly Sales', legend=True)

output_file("weeklysales.html")
show(t)
INFO:bokeh.core.state:Session output file 'weeklysales.html' already exists, will be overwritten.

The graph above clearly shows a spike in sales during specific times within the year

Draw a line chart to get the Average monthly sales for walmart stores

In [16]:
#create dataframe to hold the mean monthly sales
monthly_sales = train_df.groupby(['Month'])['Weekly_Sales'].mean()
monthly_sales = monthly_sales.to_frame()
monthly_sales.head()
Out[16]:
Weekly_Sales
Month
1 13690.901684
2 15845.684514
3 15621.023918
4 15789.103753
5 16631.566204
In [17]:
monthly_sales_bar = Bar(monthly_sales, values='Weekly_Sales', title="Mean Monthly Sales")
output_file("Mean_monthly_sales.html")
show(monthly_sales_bar)
INFO:bokeh.core.state:Session output file 'Mean_monthly_sales.html' already exists, will be overwritten.

From the data, december has more weekly sales than other months

Monthly sales by type of store

In [18]:
monthly_sales_bytype = train_df.groupby(['Month','Type'])['Weekly_Sales'].mean()
monthly_sales_bytype = monthly_sales_bytype.to_frame() 
monthly_sales_bytype.reset_index(inplace=True)
monthly_sales_bytype.head()
Out[18]:
Month Type Weekly_Sales
0 1 A 16607.087261
1 1 B 10613.629543
2 1 C 10488.417500
3 2 A 20146.899831
4 2 B 11379.468453
In [19]:
monthly_sales_bytype_bar = Bar(monthly_sales_bytype, label = 'Month', values='Weekly_Sales', group = 'Type',  title="Mean Monthly Sales By Type of Store")
output_file("Mean_monthly_sales_bytype.html")
show(monthly_sales_bytype_bar)
INFO:bokeh.core.state:Session output file 'Mean_monthly_sales_bytype.html' already exists, will be overwritten.

from the bar chart above it is clear that stores of type A have highest average weekly sales followed by stores of type B then type C

In [20]:
p = Scatter(train_df, x='Temperature', y='Weekly_Sales',  title="Temperature vs Weekly Sales",
            xlabel="Temperature in Degrees Farenheit", ylabel="Weekly Sales")

output_file("temperature_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'temperature_sales_scatter.html' already exists, will be overwritten.

From the table above there is no relationship between temperature and weekly sales

Is there a relationship between walmart store size and weekly sales?

In [21]:
store_size = train_df.groupby(['Size'])["Weekly_Sales"].mean()
store_size = store_size.to_frame()
store_size.reset_index(inplace=True)
store_size.head()
Out[21]:
Size Weekly_Sales
0 34875 5029.343401
1 37392 5715.923182
2 39690 8132.954126
3 39910 8015.691678
4 41062 13429.191404
In [22]:
p = Scatter(store_size, x='Size', y='Weekly_Sales',  title="Store size vs Weekly Sales",
            xlabel="Store size in sqft", ylabel="Weekly Sales")

output_file("store_size_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'store_size_sales_scatter.html' already exists, will be overwritten.

From the scatter plot there seems to be a linear relationship between store size and the average amount of weekly sales

Is there a relationship between fuel price and Weekly sales?

In [23]:
p = Scatter(train_df, x='Fuel_Price', y='Weekly_Sales',  title="Fuel Price vs Weekly Sales",
            xlabel="Fuel Price in dollars/gallon", ylabel="Weekly Sales")

output_file("fuel_price_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'fuel_price_sales_scatter.html' already exists, will be overwritten.

Conclusion: From the plot above there is no relationship between fuel price and weekly sales

Is there a relationship between Consumer Price Index(CPI) and Weekly sales?

In [24]:
p = Scatter(train_df, x='CPI', y='Weekly_Sales',  title="Consumer Price Index vs Weekly Sales",
            xlabel="Consumer Price Index", ylabel="Weekly Sales")

output_file("CPI_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'CPI_sales_scatter.html' already exists, will be overwritten.

From the scatter plot above there is no relationship between consumer price Index and weekly sales

Is there a relationship between Unemployment and Weekly sales?

In [25]:
p = Scatter(train_df, x='Unemployment', y='Weekly_Sales',  title="Unemployment vs Weekly Sales",
            xlabel="Unemployment", ylabel="Weekly Sales")

output_file("unemployment_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'unemployment_sales_scatter.html' already exists, will be overwritten.

From the Scatter plot above there exists no relationship between unemployment and weekly sales

Is there a relationship between holidays and weekly sales

In [26]:
holiday_df = train_df.groupby(['Holiday'])['Weekly_Sales'].mean()
holiday_df = holiday_df.to_frame()
holiday_df.reset_index(inplace=True)
holiday_df.head()
Out[26]:
Holiday Weekly_Sales
0 0 16086.639208
1 1 16484.432381
In [27]:
holiday_df_bar = Bar(holiday_df, values='Weekly_Sales',  title="Mean Holiday Sales")
output_file("Mean_holiday_sales.html")
show(holiday_df_bar)
INFO:bokeh.core.state:Session output file 'Mean_holiday_sales.html' already exists, will be overwritten.

Conclusion: From the bar graph above holidays experience more sales than non-holidays

Is there a relationship between Store department and weekly sales?

In [28]:
dept_df = train_df.groupby(['Dept'])['Weekly_Sales'].mean()
dept_df = dept_df.to_frame()
dept_df.reset_index(inplace=True)
dept_df.head()
Out[28]:
Dept Weekly_Sales
0 1 20311.418150
1 2 43997.007167
2 3 12162.144145
3 4 25665.706198
4 5 20592.130072
In [29]:
dept_df_bar = Bar(dept_df, values='Weekly_Sales',  title="Mean Holiday Sales")
output_file("Mean_department_sales.html")
show(dept_df_bar)
INFO:bokeh.core.state:Session output file 'Mean_department_sales.html' already exists, will be overwritten.

From the Bar chart above there seems to be a strong relationship between Store department and weekly sales

Summary of Exploratory Data Analysis

Store size, Holiday, Store type, Store department month of the year have a strong effect on weekly Sales while other factors such as Temperature, Consumer Price Index (CPI), Fuel price and Unemployment have little to no impact on weekly sales

4. Analysis, Hypothesis testing and Machine Learning

In [30]:
#import libraries that will be used for machine learning
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

From the exploratory data analysis Store size, Holiday, Store type, Store department and month of the year have an impact on a store's weekly sales. Therefore we will only consider these factors in determining weekly sales. I will use a KNN regression classifier because to predict future weekly sales because since weekly sales are influenced by several factors such as Store size, Holiday and Store department, I will first find the nearest neighbors to the values I am trying to predict by classifying values based on Holiday, Store type, Store department, month and year. I will then apply linear regression on the nearest neighbors to predict the values of the score.

Below is a simple example of a KNN regressor predictor

In [31]:
#sample input X and y
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

#knn regressor model that takes n parameters
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y)

#predict value based on model
print(neigh.predict([[1.5]]))
[ 0.5]

How many nearest neighbors should we use in the linear regression? I will determine the accuracy of weekly sales predictions and choose the optimum number for the nearest neighbors. I will use K-Fold cross-validation) to determine the accuracy of my predictions.

In [32]:
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]

#array keeps track of the score of each size of nearest neighbor
scores = []
for n in range(1,12):
    neigh = KNeighborsRegressor(n_neighbors=n, weights = 'distance')
    score = cross_val_score(neigh, X, y, cv = 10)
    scores.append([n, score.mean()])

#convert array to dataframe 
scores = pd.DataFrame(data = scores,columns = ['neighbors', 'cross_val_score'])
scores
Out[32]:
neighbors cross_val_score
0 1 0.428224
1 2 0.585116
2 3 0.613985
3 4 0.622043
4 5 0.616629
5 6 0.612779
6 7 0.605415
7 8 0.598018
8 9 0.593538
9 10 0.587940
10 11 0.582610

Plot the results from the table to show the optimum size to use for number of nearest neighbors

In [33]:
line = Line(scores, x='neighbors',y='cross_val_score', title="K-Nearest neighbors vs cross validation score", legend="top_left", xlabel = 'k-nearest neighbor', ylabel='cross validation score')
output_file("neighbors_score.html")
show(line)
INFO:bokeh.core.state:Session output file 'neighbors_score.html' already exists, will be overwritten.

Conclusion: From the graph above the number of nearest neighbors to use as an estimator reaches its peak when nearest neighbors at 4.

What is the optimum size to use for K-fold cross validation? First step is to determine the accuracy of your predictions using different sizes for K-fold cross validation.

In [34]:
scores_kfold = []
for kfold_size in range(3,30):
    neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
    score = cross_val_score(neigh, X, y, cv = kfold_size)
    
    scores_kfold.append([kfold_size, score.mean()])
    
scores_kfold_df = pd.DataFrame(data = scores_kfold,columns = ['kfold_size', 'cross_val_score'])
scores_kfold_df
Out[34]:
kfold_size cross_val_score
0 3 0.567496
1 4 0.587506
2 5 0.598334
3 6 0.608562
4 7 0.607180
5 8 0.617994
6 9 0.621089
7 10 0.622043
8 11 0.629222
9 12 0.624452
10 13 0.625990
11 14 0.627783
12 15 0.630670
13 16 0.630800
14 17 0.636557
15 18 0.628562
16 19 0.631969
17 20 0.628799
18 21 0.630954
19 22 0.634465
20 23 0.632282
21 24 0.634921
22 25 0.631537
23 26 0.635553
24 27 0.634569
25 28 0.639114
26 29 0.632023

Second step is plot the results

In [35]:
line = Line(scores_kfold_df, x='kfold_size',y='cross_val_score', title="Kfold sizes vs cross validation score", legend="top_left", xlabel = 'kfold sizes', ylabel='cross validation score')
output_file("kfold_sizes_score.html")
show(line)
INFO:bokeh.core.state:Session output file 'kfold_sizes_score.html' already exists, will be overwritten.

Observation: peak is reached when kfold-sizes is 20 and the graph levels off. Conclusion: USe 20 as the k-fold size for the cross validation score because beyond 20 as the size increases there are is no significant improvement of the cross-validation score.

5. Insight Policy and Decision Making

In [36]:
from IPython.display import HTML

We can now use our KNearest regressor model to predict future sales which will now help us learn more about future sales

In [37]:
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]
neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
neigh.fit(X, y) 

X_test = test[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
test_predict = neigh.predict(X_test)
test['Weekly_Sales_Predictions'] = test_predict
test.head(25)
Out[37]:
Store Dept Date IsHoliday Year Month Day Weekly_Sales_Predictions
0 1 1 2012-11-02 False 2012 11 2 38808.609103
1 1 1 2012-11-09 False 2012 11 9 38808.609103
2 1 1 2012-11-16 False 2012 11 16 38808.609103
3 1 1 2012-11-23 True 2012 11 23 40141.870925
4 1 1 2012-11-30 False 2012 11 30 38808.609103
5 1 1 2012-12-07 False 2012 12 7 43268.107314
6 1 1 2012-12-14 False 2012 12 14 43268.107314
7 1 1 2012-12-21 False 2012 12 21 43268.107314
8 1 1 2012-12-28 True 2012 12 28 43987.942941
9 1 1 2013-01-04 False 2013 1 4 34278.951545
10 1 1 2013-01-11 False 2013 1 11 34278.951545
11 1 1 2013-01-18 False 2013 1 18 34278.951545
12 1 1 2013-01-25 False 2013 1 25 34278.951545
13 1 1 2013-02-01 False 2013 2 1 36261.364928
14 1 1 2013-02-08 True 2013 2 8 37457.053447
15 1 1 2013-02-15 False 2013 2 15 36261.364928
16 1 1 2013-02-22 False 2013 2 22 36261.364928
17 1 1 2013-03-01 False 2013 3 1 31958.998085
18 1 1 2013-03-08 False 2013 3 8 31958.998085
19 1 1 2013-03-15 False 2013 3 15 31958.998085
20 1 1 2013-03-22 False 2013 3 22 31958.998085
21 1 1 2013-03-29 False 2013 3 29 31958.998085
22 1 1 2013-04-05 False 2013 4 5 39872.134366
23 1 1 2013-04-12 False 2013 4 12 39872.134366
24 1 1 2013-04-19 False 2013 4 19 39872.134366

From the exploratory data analysis we realised that Stores of type A produce significantly higher sales than type B and type C stores. Inorder to increase sales we should consider focus our efforts on type A stores since they are the store's biggest revenue driver.

In [38]:
HTML(filename='Mean_monthly_sales_bytype.html')
Out[38]:
Bokeh Plot

Monthly sales also differ depending on store. We should expect our highest sales during the months of november and december while our lowest sales will probably be in the months of march and september.

In [39]:
HTML(filename='Mean_monthly_sales.html')
Out[39]:
Bokeh Plot

The scatter plot below shows that as the store size increases sales should increase. Management should expect higher sales from larger stores and should use the data to determine which stores efficiently use store size to increase weekly sales.

In [40]:
HTML(filename="store_size_sales_scatter.html")
Out[40]:
Bokeh Plot